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Abstract 

We consider variations on the following problem: given an NFA M over the alphabet 
S and a pattern p over some alphabet A, does there exist an x £ L{M) such that p 
matches x? We consider the restricted problem where M only accepts a finite language. 
We also consider the variation where the pattern p is required only to match a factor 
of X. We show that both of these problems are NP-complete. We also consider the 
same problems for context-free grammars; in this case the problems become PSPACE- 
complete. 



1 Introduction 

The computational complexity of pattern matching has received much attention in the lit- 
erature. Although determining whether a given word appears inside another can be done 
in linear time, other pattern-matching problems appear to be computationally intractable. 
In a classic paper, Angluin [2] showed the problem of determining if an arbitrary pattern 
matches an arbitrary string is NP-complete. More recently, Anderson et al. [1] showed that 
pattern matching becomes PSPACE-complete if we are trying to match a pattern against 
words of a language specified by a DFA, NFA, or regular expression. 
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In this paper we consider some variations on the pattern matching problem. We begin 
by fixing our notation. 

Let S be an alphabet, i.e., a nonempty, finite set of symbols (letters). By S* we denote 
the set of all finite words (strings of symbols) over E, and by e, the empty word (the word 
having zero symbols). U w = xyz, then y is said to be a factor of w. 

Let > 2 be an integer. A word y is a k -power if y can be written as y = for some 
non-empty word x. A 2-power is called a square. Patterns are a generalization of powers. 
A pattern is a non-empty word p over a pattern alphabet A. The letters of A are called 
variables. A morphism is a map : E* — > A* such that h{xy) = h{x)h{y) for all x, ?y G E; a 
morphism is non-erasing if h{a) ^ e for all a G E. A pattern p matches a word w G E* if 
there exists a non-erasing morphism /i : A* — ^ E* such that h{p) = w. Thus, a word w is a 
fc-power if it matches the pattern a'^. 

As mentioned above, Anderson et al. [T] proved that proved that the following problem 
is PSPACE-complete. 

DFA/NFA PATTERN ACCEPTANCE 

INSTANCE: A DFA or NFA M over the alphabet E and a pattern p over some 
alphabet A. 

QUESTION: Does there exist x G L{M) such that p matches x? 

In this paper we consider variations on this problem. We consider the restricted problem 
where the input machine only accepts a finite language. We also consider the variation where 
the pattern p is required only to match a factor of x. We show that both of these problems 
are NP-complete. We also consider the same problems for context-free grammars; in this 
case the problems become PSPACE-complete. 

2 Detecting patterns in finite regular languages 

We first recall the DFA INTERSECTION problem. This problem is well-known to be 
PSPACE-complete [31 Problem AL6]. 

DFA INTERSECTION 

INSTANCE: DFAs Ai, A2, . . . , Ak, each over the alphabet E. 

QUESTION: Does there exist a; G E* such that x is accepted by each Ai, 1 < 

i < k? 

Theorem 1. The DFA INTERSECTION problem is NP-complete if the input DFAs only 
accept finite languages. 

Proof. We first show that the problem is in NP. Suppose that each A^ has at most n states. 
If Ai accepts a finite language then it only accepts strings of length less than n. In particular 
any string accepted in common by all the AiS has length less than n. We can therefore guess 
such a string in 0{n) time, and check if it is accepted by all of the AiS in 0{kn) time. 
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We show NP-completeness by reducing from 3-SAT. Let be a boolean formula in 3- 
CNF, with variables Vi, V2, . . . , and clauses Ci, C2, . . . , Cm- We let a binary string of 
length n uniquely encode a truth assignment of the variables, where 1 denotes true and 
denotes false. For each 1 < i < m, we construct a small DFA accepting exactly the strings 
of length n that encode an assignment of the variables Vi,V2, . . . ,Vn that satisfies clause Cj. 
For example, if Ci = V^i V V2 V V4, then our DFA would accept the strings 

1{0, 1}"-^ U {0, 1}0{0, u {0, 1}^1{0, 1}"-^ 

Such a DFA can be constructed using at most 2n+l states. The total number of DFA's is the 
total number of clauses, and their intersection is nonempty if and only if there is a satisfying 
assignment to (p. The construction can clearly be carried out in polynomial time. □ 

Theorem 2. The NFA PATTERN ACCEPTANCE problem is NP-complete if the input 
NFA M accepts a finite language. The problem remains NP-complete if M is deterministic. 

Proof. If M has n states, then M accepts no word of length n or more; if it did, then by the 
pumping lemma M would accept infinitely many words. We can therefore solve the pattern 
acceptance problem by guessing a word x in L{M) of length less than n and a morphism h 
(also of bounded size) and verifying that h{p) = x in polynomial time. 

To show that the problem is NP-hard, we apply Theorem [H Given DFAs Ai, A2, . . . , Af^, 
each over the alphabet S, and each accepting a finite language, we construct a DFA M to 
accept 

L(Ai)#---L(A,)#, 

where # is not in E. The DFA M accepts a fc-power (i.e., a word matching the pattern a*^) 
if and only if the intersection of the L{Ai)^s is non-empty. □ 

Theorem 3. The problem "Given an NFA M and a pattern p, is there a non-erasing mor- 
phism h and a word w in L{M) such that h{p) is a factor of w?" is NP-complete. The 
problem remains NP-complete if M is deterministic. 

Proof. To see that it is in NP, note that answer is always "yes" if L{M) is infinite (because 
then by the pumping lemma L{M) contains xy*z for some x, y, z, and if it contains arbitrarily 
high powers of y then it contains any pattern as a factor). We can check if L{M) is infinite 
in polynomial time. Otherwise, the size of the morphism h is bounded, and we can guess 
both h and w in polynomial time and verify that h{p) is a factor of w. 

To see the problem is NP-complete, we give a reduction from 3-SAT that is a simple 
modification of the construction of Angluin [2, Theorem 3.6]. Let (/? be a boolean formula in 
3- CNF, with variables V^i, V2, . . . , 14 and clauses Ci, C2, . . . , Cm- We define a pattern p with 
variables Xi, yi, 1 < i < n and zj, uj, 1 < j < m, and v. 

For 1 < j < m and 1 < A; < 3, define 



/(j, k) 



if the k^th literal in Cj is Vf, 
if the A;'th literal in Cj is Vi. 
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Given 99, we define 

2 Tl fi 777 2 71 -\- (S 777 

P = V^ V Xiyi V X2y2 V ■■ ■ V XnVn VqiVq2V ■ ■■ VqmV ZiUi V Z2U2 ■■ ■ V ZmUm VV 

where 

=/(j,l)/(j,2)/(j,3)2, 

for I < j < m, and 

W = 02"+6'»(0l3)"(0l7)'^(0l4)'^00^"+^'^. 

We claim tliat p matclies a factor of w if and only if p matches w exactly if and only if ip is 
satisfiable. This can be established by an argument almost identical to that of Angluin [21 
Theorem 3.6]. The only difference is that we have added ti^^+em ^^^q beginning and end 
of p and 0^"+^™ to the beginning and end of w in order to enforce that p matches a factor of 
w if and only if it matches w itself. 

Now we let M be the [n + 2)-state DFA that accepts the single word w. Given a 3-CNF 
formula we can create a DFA M and pattern p such that f is satisfiable if and only if p 
matches the single string accepted by M. □ 



3 Detecting patterns in finite context-free languages 

We now consider the pattern acceptance problem for context-free languages. 

Theorem 4. The following problem is undecidahle: "Given a CFG G, does G generate a 
square?" 

Proof. We reduce from the Post correspondence problem. Given an instance of Post corre- 
spondence, say (xi, Hi), ... , ?/„), we create a CFG {V, S', P, S) as follows: we introduce 
n + 1 new symbols ci, C2, . . . , c„ not in S, and let S' = S U {7^, ci, C2, . . . , c„ }. Also let 
V = {A, B, S}, and let P be the set of productions 

S ^ 

B ViBci I HiCi, l<i<n. 

We claim L{G) contains xx if and only if the PGP instance (xi, yi), . . . , ?/„) has a solution. 

□ 

The previous problem clearly becomes decidable if G only generates a finite language. 
Next we consider the computational complexity of this restricted version of the problem. We 
first consider the problem of deciding whether two PDAs, each accepting a finite language, 
accept some word x in common. 

Let M be an TM and let w be an input to M. Let us suppose without loss of generality 
that all halting computations of M on w take an even number of steps. It is well-known 
(e.g., ^ Lemma 8.6]) that one can construct context-free grammars Gi and G2 to generate 
languages Li and L2 consisting of words of the form ci#cf # ■ ■ ■ #Cfc_i#cf^#, where 
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• Each Cj encodes a valid configuration of M. 

• In L2, the word Ci encodes the initial configuration of M on input w and the word 
encodes a valid accepting configuration of M. 

• In Li (resp. L2), the configuration q+i follows from configuration q according to the 
transition function of M for all odd (resp. even) i < k. 

Suppose now that M is a polynomial space bounded TM; i.e., for some polynomial p{n), 
M uses at most p{n) space on inputs of length n. Consider the languages Li and L2 described 
above, except that now we require that each configuration Cj have length at most and 
that k < 2P('"'I) (since there are at most 2^^^'^^^ distinct configurations in any computation of 
M onw). We can construct Gi and G2 as follows. We will actually describe the construction 
of a PDA Ml accepting Li (the construction for L2 is similar). 

First, let us observe that we can count in binary up to 

2P(") on Mi's stack by usmg 

0{p{n)) states of the finite control. These states simply keep track of how many bits we are 
pushing or popping when incrementing the counter. 

We therefore recognize a word of the form ciT^cf^T^ ■ ■ ■ #c/c-i#c|^# as follows. We main- 
tain a binary counter on the stack that counts the number of q's that we have currently 
processed. Every time we encounter a new pair Cj^c^^ we interrupt the current compu- 
tation on the stack — let's say we push a new temporary bottom of stack symbol onto the 
stack — and we process Cj#cf^i just as in the standard construction. 

While reading each Cj (or cf ), we must also verify that the length of Cj is at most p{n). 
We do this by adding polynomially many states to the finite control of Mi; these states are 
used to keep track of the length of each q and to verify that this length does not exceed 
p{n). 

After verifying that q+i follows from Cj, the stack now once more only contains the 
counter recording the number of q's processed so far. We can now increment this counter 
and continue to process the remaining pairs in the same manner. After reading all of the 
input we verify by popping the stack that there were at most 2^^""^ q's. 

Observe that since the length of each q in Li is at most p(|i«|) and since k < 2P(I"'I), the 
language Li consists of only finitely many words. Furthermore, the PDAs Mi and M2 can 
be constructed in polynomial time. 

Before proceeding further we require the following lemma, which appears to be part of 
the folklore. A weaker result was stated without proof by Meyer and Fischer [6l Proof of 
Proposition 5]. 

Lemma 5. Let M be a PDA with n states and a stack alphabet of size s that accepts by 
empty stack and that either pushes a single symbol onto the stack or pops a single symbol 
from the stack on each move. If M accepts a finite language, then for any input w accepted 
by M, there is an accepting computation for which the maximum stack height is at most sn^ . 

Proof. Consider a shortest accepting computation of M on an input w. Suppose that this 
computation has maximum stack height H > sn^ and that this height H is reached after 
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exactly T steps. For i = 1,2, . . . , H, let denote the last time before time T that the 
computation had stack height i and let r{i) denote the first time after time T that the 
computation had stack height i. Let C{i) = {p,A,q), where at time the computation 
was in state p with A on top of the stack and at time r(z) the computation was in state q. 
Since the stack height never dips below i between times and r{i), the symbol on top of 
the stack at time r{i) is the same as at time There are only sn"^ distinct triples {p. A, q), 
so C{i) = C{j) for some i < j. We may therefore write w = uvwxy such that 

• M is the portion of the input processed after steps; 

• uv is the portion of the input processed after steps; 

• uvw is the portion of the input processed after r{j) steps; 

• uvwx is the portion of the input processed after r(z) steps. 

However, we now see that uv^wx^y is accepted by M for all positive integers i. Furthermore, 
vx 7^ e — i.e., the portions of the computation between times /(z) and and between times 
r{j) and r{i) do not consist entirely of e-transitions. If indeed vx = e, then we could obtain 
a shorter accepting computation of M on w, contradicting the assumed minimality of this 
computation. Thus M accepts an infinite language, a contradiction. We conclude that the 
maximum stack height of a shortest accepting computation is at most sn^. □ 

We assume without loss of generality that all PDAs considered from now on accept by 
empty stack and either push a single symbol or pop a single symbol on each move. 

Theorem 6. The following problem is PSPACE-complete: "Given PDAs Ai, A2, . . . , Ak, 
each over the alphabet S, and each accepting a finite language, does there exist x G S* such 
that x is accepted by each Ai, 1 < i < k?" The problem is PSPACE-complete even when 
k = 2. 

Proof. To show that the problem is in PSPACE, note that each Ai has an equivalent CFG 
Gi whose size is bounded above by a polynomial in the size of Ai. Any word generated by 
Gi has length bounded above by a function exponential in the size of Gi. 

We therefore give an NPSPACE algorithm as follows. Guess a word w one symbol at a 
time and simulate each Ai in parallel on vu. By Lemma [5l the total space required to store 
the stack contents of Ai during the simulation is at most sn"^, where s is the size of Aj's stack 
alphabet and n is the number of states of Ai. It follows that the total space required for 
the parallel simulation of the Aj's is polynomial in the combined size of the Aj's. We reject 
on any branch of the simulation that exceeds the bound on the stack height. There are s^^ 
stack configurations total, so we can keep an O(sn^) size counter to detect if we enter an 
infinite loop on e-transitions; if so we reject on this branch of the simulation as well. This 
non-deterministic algorithm can then be determinized by Savitch's theorem. 

To see that the problem is PSPACE-hard, it suffices to observe that given a polynomial 
space bounded TM M and a word w, we can construct the PDAs Mi and M2 described 
above in polynomial time. The language L{Mi) fl L{M2) is non-empty if and only if M 
accepts w. This completes the reduction. □ 



6 



Theorem 7. The following problem is PSPACE-complete: "Given a CFG G generating a 
finite language, does G generate a square?" 

Proof. To see that the problem is PSPACE, recaU that if a G generates a finite language, 
there is an exponential bound on the length of the words in the language. We now convert 
G to a PDA M in polynomial time. Let M have n states and a stack alphabet of size s. We 
wish to guess the symbols of a word w of length at most exponential in the size of G and 
verify that M accepts ww. By Lemma [5|, the maximum stack height of M on input ww is 
at most sn^. We therefore guess a configuration C of M of size 0{sn^) and simulate two 
copies of M on the guessed symbols of to, the first starting from the initial configuration and 
the second starting from the configuration C. If the first simulation ends in configuration C 
and the second simulation ends in an accepting configuration, then M accepts ww. Again, 
we can determinize this construction by Savitch's theorem. 

To see that the problem is PSPACE-hard, it suffices to observe that given a polynomial 
space bounded TM M and a word w, we can construct the CFGs Gi and G2 described above 
in polynomial time. We can then construct a CFG G to generate L(Gi)#L(G'2)#- However, 
L{G) contains a square if and only if M accepts w. This completes the reduction. □ 

Theorem 8. The problem "Given an CFG G and a pattern p, is there a non-erasing mor- 
phism h and a word w in L{G) such that h{p) is a factor of w?" is PSPACE-complete. 

Proof. To see that it is in PSPACE, note that answer is always "yes" if L{G) is infinite 
(because then by the pumping lemma L{G) contains words with arbitrarily high powers as 
factors). We can check if L{G) is infinite in polynomial time. If L{G) is finite, then the 
sizes of any morphism h and word w such that h{p) is a factor of w are bounded above by a 
function exponential in the size of G. We can therefore guess the symbols of w, the lengths 
of the images of h, and the starting position of h{p) in w in polynomial space. We may 
then verify that h{p) is a factor of w in polynomial space by a procedure analogous to that 
described in the proof of Theorem [71 which illustrated the method for the case of a pattern 
p = XX {i.e., a. square) matching w exactly. 

We begin by converting the CFG G to a PDA M. We then start guessing the symbols 
of w and simulating M on w. Recall that by Lemma [5l this simulation only requires O(sn^) 
space. When we reach the guessed starting location of h{p) in w, we record the current 
configuration Co of the simulation and proceed as follows. Let p = pip2 ■ ■ - pe- We begin by 
guessing configurations Ci, C2, . . . , such that for each I < j < i, the simulation of M goes 
from configuration Cj_i to Gj upon reading h{pj). Note that since each configuration has 
size O(sn^), we can record all of these guessed configurations in polynomial space. We verify 
our guesses as follows. For each distinct symbol x occurring in p, let = {j : Pj = x}. For 
each j G Jx, we simulate (in parallel for all j) a copy of M on the symbols of a guessed word 
h{x) (whose length we have previously guessed) starting in configuration Cj_i. Again, since 
the stack height is bounded by a polynomial in the size of G, and since \Jx\ is at most \p\, 
the total space required for these parallel simulations is polynomial in the input size. We 
repeat these parallel simulations for all distinct symbols x occurring in p. At this stage we 
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have guessed and verified the occurrence of h{p) in w; we now guess the remaining symbols 
of w to complete the simulation. 

To show PSPACE-hardness we reduce from DFA INTERSECTION. Suppose we are 
given k DFAs Mi, M2, . . . , M^, the largest having n states. We first observe that if a word x 
is accepted in common by all of the Mj's, there is such an x of length at most n^. 

We now construct a CFG G that generates a single squarefree word w of length at least 
n'^. Suppose we have a uniform morphism h (over a 3-letter alphabet disjoint from those 
of the Mj's) that generates an infinite squarefree word. For example, we may take h to be 
defined by the map 

0121021201210 

1 1202102012021 

2 2010210120102 

(see [5]). We can generate an iterate of h (i.e., one of the words /i(0), /i^(0), /i'^(O), . . .) of 
length at least n'' with a grammar of size 0{klogn). So we can construct the grammar G 
generating w in polynomial time. 

We then construct a CFG G' that generates all prefixes of words in L{G). Next we 
convert G' to a PDA (in polynomial time). Given the PDA and a DFA Mj, we can 
perform the standard construction to obtain a PDA Ai accepting the perfect shuffle of L[N) 
and L{Mi). We do this so that the strings accepted by A are all squarefree. 

Now we convert all of the A's into CFGs Bi (we can do this in polynomial time). Next 
we construct a grammar G that accepts the language L = L(i?i)#L(i?2)# • ■ ■ #I/(i?fc)#. 

The size of the grammar C accepting L is just the sum of the sizes of the Sj's, so it 
remains polynomial in the combined sizes of the original DFAs. Now, since L{Bi) consists 
only of squarefree strings, a word -u in L contains a fc-power as a factor if and only if u is 
itself a fc-power. (Note also that L is a finite language.) 

Recall that if there is an x accepted by all the Mj's, there is such an x of length at most 
n'^. Since the words in L{G') have length at most n^, there is such an x if and only if a 
string z formed by the perfect shuffle of a word x and a word in L{G') is accepted by all of 
the Ai's. But this is true if and only if the string {zjj=-Y is generated by G. This in turn is 
true if and only if G generates a word with a /c-power as a factor. 

The entire construction can be done in polynomial time. This completes the reduction. 

□ 
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